Exploratory analysis using Xorbits over NYC taxi dataset¶

Xorbits is a powerful tool for exploring and analyzing large datasets. One of the classic datasets for demonstrating the capabilities of Xorbits is the NYC taxi dataset, which contains records of taxi rides in the city from 2009 to 2022. In this blog, we will explore how to use Xorbits to do some initial exploration of the NYC taxi dataset and get a sense of what kind of insights we might be able to gain from the data.

Software versions¶

  • Xorbits==0.1.0
  • plotly==5.11.0

Datasets¶

  • TLC Trip Record Data
  • NYC Taxi Zones

Initialize¶

The first step is to initialize Xorbits:

In [1]:
import xorbits

# Initialize Xorbits in the local environment.
xorbits.init()

Data loading¶

The second step is to load the data into an Xorbits DataFrame. This can be done using the read_parquet() function, which allows us to specify the location of the parquet file and any additional options we want to use while reading the data.

In the case of the NYC taxi dataset, here is an example of how we could do this using Xorbits:

In [2]:
import datetime
import json
import xorbits.pandas as pd

trips = pd.read_parquet("path/to/yellow_tripdata_2022-*.parquet")
# Remove outliers
trips = trips[(trips['tpep_pickup_datetime'] >= datetime.datetime(2022, 1, 1)) & (trips['tpep_pickup_datetime'] <= datetime.datetime(2022, 12, 31))]

taxi_zones = pd.read_csv('path/to/taxi+_zone_lookup.csv')

with open('path/to/taxi_zones.geojson') as fd:
    geojson = json.load(fd)

Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:

In [3]:
print(trips.shape)
print(trips.dtypes)
print(trips.head())
(nan, 19)
VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
dtype: object
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2022-10-01 00:03:41   2022-10-01 00:18:39              1.0   
1         2  2022-10-01 00:14:30   2022-10-01 00:19:48              2.0   
2         2  2022-10-01 00:27:13   2022-10-01 00:37:41              1.0   
3         1  2022-10-01 00:32:53   2022-10-01 00:38:55              0.0   
4         1  2022-10-01 00:44:55   2022-10-01 00:50:21              0.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           1.70         1.0                  N           249           107   
1           0.72         1.0                  N           151           238   
2           1.74         1.0                  N           238           166   
3           1.30         1.0                  N           142           239   
4           1.00         1.0                  N           238           166   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \
0             1          9.5    3.0      0.5        2.65           0.0   
1             2          5.5    0.5      0.5        0.00           0.0   
2             1          9.0    0.5      0.5        2.06           0.0   
3             1          6.5    3.0      0.5        2.05           0.0   
4             1          6.0    0.5      0.5        1.80           0.0   

   improvement_surcharge  total_amount  congestion_surcharge  airport_fee  
0                    0.3         15.95                   2.5          0.0  
1                    0.3          9.30                   2.5          0.0  
2                    0.3         12.36                   0.0          0.0  
3                    0.3         12.35                   2.5          0.0  
4                    0.3          9.10                   0.0          0.0  

We can also use the describe() method to get a summary of the statistical properties of each numerical column in the dataset.

In [4]:
print(trips.describe())
           VendorID  passenger_count  trip_distance    RatecodeID  \
count  3.300328e+07     3.188340e+07   3.300328e+07  3.188340e+07   
mean   1.716089e+00     1.400064e+00   6.051177e+00  1.426007e+00   
std    4.862856e-01     9.669177e-01   6.135931e+02  5.824383e+00   
min    1.000000e+00     0.000000e+00   0.000000e+00  1.000000e+00   
25%    1.000000e+00              NaN   1.100000e+00           NaN   
50%    2.000000e+00              NaN   1.900000e+00           NaN   
75%    2.000000e+00              NaN   3.550000e+00           NaN   
max    6.000000e+00     9.000000e+00   3.896785e+05  9.900000e+01   

       PULocationID  DOLocationID  payment_type   fare_amount         extra  \
count  3.300328e+07  3.300328e+07  3.300328e+07  3.300328e+07  3.300328e+07   
mean   1.646682e+02  1.624247e+02  1.188556e+00  1.459878e+01  1.008178e+00   
std    6.537942e+01  7.027724e+01  5.121790e-01  1.047507e+02  1.246117e+00   
min    1.000000e+00  1.000000e+00  0.000000e+00 -2.564000e+03 -2.218000e+01   
25%    1.320000e+02  1.130000e+02  1.000000e+00  7.000000e+00  0.000000e+00   
50%    1.620000e+02  1.620000e+02  1.000000e+00  1.000000e+01  5.000000e-01   
75%    2.340000e+02  2.340000e+02  1.000000e+00  1.600000e+01  2.500000e+00   
max    2.650000e+02  2.650000e+02  5.000000e+00  4.010923e+05  3.350000e+01   

            mta_tax    tip_amount  tolls_amount  improvement_surcharge  \
count  3.300328e+07  3.300328e+07  3.300328e+07           3.300328e+07   
mean   4.891569e-01  2.707103e+00  5.272037e-01           2.961781e-01   
std    9.192201e-02  3.247909e+00  2.023945e+00           4.730772e-02   
min   -5.500000e-01 -4.100000e+02 -9.999000e+01          -3.000000e-01   
25%    5.000000e-01  9.300000e-01  0.000000e+00           3.000000e-01   
50%    5.000000e-01  2.150000e+00  0.000000e+00           3.000000e-01   
75%    5.000000e-01  3.350000e+00  0.000000e+00           3.000000e-01   
max    2.548000e+01  1.400160e+03  9.118700e+02           1.000000e+00   

       total_amount  congestion_surcharge   airport_fee  
count  3.300328e+07          3.188340e+07  3.188340e+07  
mean   2.133517e+01          2.282358e+00  9.578618e-02  
std    1.052695e+02          7.487687e-01  3.357228e-01  
min   -2.567800e+03         -2.500000e+00 -1.250000e+00  
25%    1.230000e+01                   NaN           NaN  
50%    1.595000e+01                   NaN           NaN  
75%    2.274000e+01                   NaN           NaN  
max    4.010956e+05          2.750000e+00  1.250000e+00  

Time series analysis¶

One way to analyze the NYC taxi dataset is to look at how the number of rides varies over time. We can do this by creating a new column in the DataFrame that represents the pick-up date of each ride, and then use the groupby method to group the data by month or year and compute the count of rides for each group:

In [5]:
trips['PU_date'] = trips['tpep_pickup_datetime'].dt.date
count = trips.groupby('PU_date', as_index=False).agg(count=('VendorID', 'count'))
print(count)
        PU_date   count
0    2022-01-01   63441
1    2022-01-02   58421
2    2022-01-03   72405
3    2022-01-04   74562
4    2022-01-05   74592
..          ...     ...
300  2022-10-28  132215
301  2022-10-29  127966
302  2022-10-30  108386
303  2022-10-31  101467
304  2022-11-01       9

[305 rows x 2 columns]

We can then use a library like plotly to visualize the time series data:

In [6]:
import plotly.express as px

b = px.bar(count.to_pandas(), x='PU_date', y='count')
b.show()

Spatial analysis¶

Another way to analyze the NYC taxi dataset is to look at patterns in the spatial distribution of rides. Taking Manhattan as an example, we firstly filter the dataframe by pick-up location ID:

In [7]:
manhattan_zones = taxi_zones[taxi_zones['Borough'] == 'Manhattan']['LocationID']
manhattan_trips = trips[trips['PULocationID'].isin(manhattan_zones)]
print(manhattan_trips)
         VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0               1  2022-10-01 00:03:41   2022-10-01 00:18:39              1.0   
1               2  2022-10-01 00:14:30   2022-10-01 00:19:48              2.0   
2               2  2022-10-01 00:27:13   2022-10-01 00:37:41              1.0   
3               1  2022-10-01 00:32:53   2022-10-01 00:38:55              0.0   
4               1  2022-10-01 00:44:55   2022-10-01 00:50:21              0.0   
...           ...                  ...                   ...              ...   
3588288         2  2022-05-31 23:07:04   2022-05-31 23:19:19              NaN   
3588289         2  2022-05-31 23:15:58   2022-05-31 23:24:08              NaN   
3588291         2  2022-05-31 23:52:12   2022-06-01 00:06:40              NaN   
3588292         2  2022-05-31 23:27:00   2022-05-31 23:48:00              NaN   
3588294         2  2022-05-31 23:00:53   2022-05-31 23:07:54              NaN   

         trip_distance  RatecodeID store_and_fwd_flag  PULocationID  \
0                 1.70         1.0                  N           249   
1                 0.72         1.0                  N           151   
2                 1.74         1.0                  N           238   
3                 1.30         1.0                  N           142   
4                 1.00         1.0                  N           238   
...                ...         ...                ...           ...   
3588288           2.59         NaN               None           164   
3588289           1.86         NaN               None           114   
3588291           2.91         NaN               None           144   
3588292           3.96         NaN               None           161   
3588294           1.43         NaN               None           238   

         DOLocationID  payment_type  fare_amount  extra  mta_tax  tip_amount  \
0                 107             1         9.50    3.0      0.5        2.65   
1                 238             2         5.50    0.5      0.5        0.00   
2                 166             1         9.00    0.5      0.5        2.06   
3                 239             1         6.50    3.0      0.5        2.05   
4                 166             1         6.00    0.5      0.5        1.80   
...               ...           ...          ...    ...      ...         ...   
3588288           231             0        13.52    0.0      0.5        3.73   
3588289           107             0        10.88    0.0      0.5        1.57   
3588291           256             0        11.68    0.0      0.0        2.51   
3588292           145             0        14.54    0.0      0.5        0.00   
3588294            41             0        10.05    0.0      0.5        2.21   

         tolls_amount  improvement_surcharge  total_amount  \
0                 0.0                    0.3         15.95   
1                 0.0                    0.3          9.30   
2                 0.0                    0.3         12.36   
3                 0.0                    0.3         12.35   
4                 0.0                    0.3          9.10   
...               ...                    ...           ...   
3588288           0.0                    0.3         20.55   
3588289           0.0                    0.3         15.75   
3588291           0.0                    0.3         16.99   
3588292           0.0                    0.3         17.84   
3588294           0.0                    0.3         15.56   

         congestion_surcharge  airport_fee     PU_date  
0                         2.5          0.0  2022-10-01  
1                         2.5          0.0  2022-10-01  
2                         0.0          0.0  2022-10-01  
3                         2.5          0.0  2022-10-01  
4                         0.0          0.0  2022-10-01  
...                       ...          ...         ...  
3588288                   NaN          NaN  2022-05-31  
3588289                   NaN          NaN  2022-05-31  
3588291                   NaN          NaN  2022-05-31  
3588292                   NaN          NaN  2022-05-31  
3588294                   NaN          NaN  2022-05-31  

[29403245 rows x 20 columns]

Then use the groupby method to group the data by pick-up location ID and compute the count of rides for each group:

In [8]:
gb_pu_location = manhattan_trips.groupby(['PULocationID'], as_index=False).agg(count=('VendorID', 'count')).to_pandas()

We can then use a library like plotly to visualize the spatial distribution of rides:

In [ ]:
m = px.choropleth(
    gb_pu_location,
    geojson=geojson,
    locations='PULocationID',
    featureidkey='properties.location_id',
    color='count',
    color_continuous_scale="Viridis",
    range_color=(0, gb_pu_location['count'].max()),
    labels={'count':'trips count'}
)
m.update_geos(fitbounds="locations", visible=False)
m.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
m.show()